Multi-tier Dynamic Vectorization for Translating GPU Optimizations into CPU Performance

نویسندگان

  • Hee-Seok Kim
  • Izzat El Hajj
  • John A Stratton
  • Wen-Mei W. Hwu
چکیده

Developing high performance GPU code is labor intensive. Ideally, developers could recoup high GPU development costs by generating high-performance programs for CPUs and other architectures from the same source code. However, current OpenCL compilers for non-GPUs do not fully exploit optimizations in well-tuned GPU codes. To address this problem, we develop an OpenCL implementation that efficiently exploits GPU optimizations on multicore CPUs. Our implementation translates SIMT parallelism into SIMD vectorization and SIMT coalescing into cache-efficient access patterns. These translations are especially challenging when control divergence is present. Our system addresses divergence through a multi-tier vectorization approach based on dynamic convergence checking. The proposed approach outperforms existing industry implementations achieving geometric mean speedups of 2.26× and 1.09× over AMD’s and Intel’s OpenCL implementations respectively.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Importance of Explicit Vectorization for CPU and GPU Software Performance

Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are less frequently discussed. In this paper, we present an analysis of several optimizations done on both central processing unit (CPU) and GPU imp...

متن کامل

Accelerating High-Dimensional Nearest Neighbors for Video Search

The k-nearest neighbor algorithm (kNN) is a critical algorithm used extensively in fields such as Computer Vision, Robotics, and Machine Learning. In this work, we address the performance of FLANN, a popular kNN library, at the node-level by co-designing indexing and search algorithms with software support. We characterize, profile, and optimize FLANN for high-dimensionality (e.g., ≥ 4096) for ...

متن کامل

Cross-Platform OpenCL Code and Performance Portability for CPU and GPU Architectures Investigated with a Climate and Weather Physics Model

Current multiand many-core computing typically involves multi-core Central Processing Units (CPU) and many-core Graphical Processing Units (GPU) whose architectures are distinctly different. To keep longevity of application codes, it is highly desirable to have a programming paradigm to support these current and future architectures. Open Computing Language (OpenCL) is created to address this p...

متن کامل

Tuning Principal Component Analysis for GRASS GIS on Multi-core and GPU Architectures

This paper presents optimizations to Principal Component Analysis (PCA) in GRASS GIS. The current implementation of PCA in GRASS is based on eigenvalue decomposition, which does not have high memory requirements but it can suffer from low runtime performance. In modern computers, significant performance improvements can be achieved by appropriately taking advantage of the memory configuration (...

متن کامل

Weld: Fast Data-Parallel Computation on Modern Hardware

Modern hardware is difficult to use efficiently, requiring complex optimizations like vectorization, loop blocking and load balancing to get good performance. As a result, many widely used data processing systems fall well short of peak hardware performance. We have developed Weld, an intermediate language and runtime that can run data-parallel computations efficiently on modern hardware. The c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015